benchmark study
No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings
Bioacoustics, the study of animal sounds, offers a non-invasive method to monitor ecosystems. Extracting embeddings from audio-pretrained deep learning (DL) models without fine-tuning has become popular for obtaining bioacoustic features for tasks. However, a recent benchmark study reveals that while fine-tuned audio-pretrained VGG and transformer models achieve state-of-the-art performance in some tasks, they fail in others. This study benchmarks 11 DL models on the same tasks by reducing their learned embeddings' dimensionality and evaluating them through clustering. We found that audio-pretrained DL models 1) without fine-tuning even underperform fine-tuned AlexNet, 2) both with and without fine-tuning fail to separate the background from labeled sounds, but ResNet does, and 3) outperform other models when fewer background sounds are included during fine-tuning. This study underscores the necessity of fine-tuning audio-pretrained models and checking the embeddings after fine-tuning. Our codes are available: https://github.com/NeuroscienceAI/Audio\_Embeddings
VIBES -- Vision Backbone Efficient Selection
Guerin, Joris, Bansal, Shray, Shaban, Amirreza, Mann, Paulo, Gazula, Harshvardhan
This work tackles the challenge of efficiently selecting high-performance pre-trained vision backbones for specific target tasks. Although exhaustive search within a finite set of backbones can solve this problem, it becomes impractical for large datasets and backbone pools. To address this, we introduce Vision Backbone Efficient Selection (VIBES), which aims to quickly find well-suited backbones, potentially trading off optimality for efficiency. We propose several simple yet effective heuristics to address VIBES and evaluate them across four diverse computer vision datasets. Our results show that these approaches can identify backbones that outperform those selected from generic benchmarks, even within a limited search budget of one hour on a single GPU. We reckon VIBES marks a paradigm shift from benchmarks to task-specific optimization.
A method to benchmark high-dimensional process drift detection
Process curves are multi-variate finite time series data coming from manufacturing processes. This paper studies machine learning methods for drifts of process curves. A theoretic framework to synthetically generate process curves in a controlled way is introduced in order to benchmark machine learning algorithms for process drift detection. A evaluation score, called the temporal area under the curve, is introduced, which allows to quantify how well machine learning models unveil curves belonging to drift segments. Finally, a benchmark study comparing popular machine learning approaches on synthetic data generated with the introduced framework shown.
OpenHEXAI: An Open-Source Framework for Human-Centered Evaluation of Explainable Machine Learning
Ma, Jiaqi, Lai, Vivian, Zhang, Yiming, Chen, Chacha, Hamilton, Paul, Ljubenkov, Davor, Lakkaraju, Himabindu, Tan, Chenhao
Recently, there has been a surge of explainable AI (XAI) methods driven by the need for understanding machine learning model behaviors in high-stakes scenarios. However, properly evaluating the effectiveness of the XAI methods inevitably requires the involvement of human subjects, and conducting human-centered benchmarks is challenging in a number of ways: designing and implementing user studies is complex; numerous design choices in the design space of user study lead to problems of reproducibility; and running user studies can be challenging and even daunting for machine learning researchers. To address these challenges, this paper presents OpenHEXAI, an open-source framework for human-centered evaluation of XAI methods. OpenHEXAI features (1) a collection of diverse benchmark datasets, pre-trained models, and post hoc explanation methods; (2) an easy-to-use web application for user study; (3) comprehensive evaluation metrics for the effectiveness of post hoc explanation methods in the context of human-AI decision making tasks; (4) best practice recommendations of experiment documentation; and (5) convenient tools for power analysis and cost estimation. OpenHEAXI is the first large-scale infrastructural effort to facilitate human-centered benchmarks of XAI methods. It simplifies the design and implementation of user studies for XAI methods, thus allowing researchers and practitioners to focus on the scientific questions. Additionally, it enhances reproducibility through standardized designs. Based on OpenHEXAI, we further conduct a systematic benchmark of four state-of-the-art post hoc explanation methods and compare their impacts on human-AI decision making tasks in terms of accuracy, fairness, as well as users' trust and understanding of the machine learning model.
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence
McIntosh, Timothy R., Susnjak, Teo, Liu, Tong, Watters, Paul, Halgamuge, Malka N.
The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities has spurred public curiosity to evaluate and compare different LLMs, leading many researchers to propose their LLM benchmarks. Noticing preliminary inadequacies in those benchmarks, we embarked on a study to critically assess 23 state-of-the-art LLM benchmarks, using our novel unified evaluation framework through the lenses of people, process, and technology, under the pillars of functionality and security. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, evaluator diversity, and the overlooking of cultural and ideological norms in one comprehensive assessment. Our discussions emphasized the urgent need for standardized methodologies, regulatory certainties, and ethical guidelines in light of Artificial Intelligence (AI) advancements, including advocating for an evolution from static benchmarks to dynamic behavioral profiling to accurately capture LLMs' complex behaviors and potential risks. Our study highlighted the necessity for a paradigm shift in LLM evaluation methodologies, underlining the importance of collaborative efforts for the development of universally accepted benchmarks and the enhancement of AI systems' integration into society.
Structure-based out-of-distribution (OOD) materials property prediction: a benchmark study
Omee, Sadman Sadeed, Fu, Nihang, Dong, Rongzhi, Hu, Ming, Hu, Jianjun
In real-world material research, machine learning (ML) models are usually expected to predict and discover novel exceptional materials that deviate from the known materials. It is thus a pressing question to provide an objective evaluation of ML model performances in property prediction of out-of-distribution (OOD) materials that are different from the training set distribution. Traditional performance evaluation of materials property prediction models through random splitting of the dataset frequently results in artificially high performance assessments due to the inherent redundancy of typical material datasets. Here we present a comprehensive benchmark study of structure-based graph neural networks (GNNs) for extrapolative OOD materials property prediction. We formulate five different categories of OOD ML problems for three benchmark datasets from the MatBench study. Our extensive experiments show that current state-of-the-art GNN algorithms significantly underperform for the OOD property prediction tasks on average compared to their baselines in the MatBench study, demonstrating a crucial generalization gap in realistic material prediction tasks. We further examine the latent physical spaces of these GNN models and identify the sources of CGCNN, ALIGNN, and DeeperGATGNN's significantly more robust OOD performance than those of the current best models in the MatBench study (coGN and coNGN), and provide insights to improve their performance.
LOB-Based Deep Learning Models for Stock Price Trend Prediction: A Benchmark Study
Prata, Matteo, Masi, Giuseppe, Berti, Leonardo, Arrigoni, Viviana, Coletta, Andrea, Cannistraci, Irene, Vyetrenko, Svitlana, Velardi, Paola, Bartolini, Novella
The recent advancements in Deep Learning (DL) research have notably influenced the finance sector. We examine the robustness and generalizability of fifteen state-of-the-art DL models focusing on Stock Price Trend Prediction (SPTP) based on Limit Order Book (LOB) data. To carry out this study, we developed LOBCAST, an open-source framework that incorporates data preprocessing, DL model training, evaluation and profit analysis. Our extensive experiments reveal that all models exhibit a significant performance drop when exposed to new data, thereby raising questions about their real-world market applicability. Our work serves as a benchmark, illuminating the potential and the limitations of current approaches and providing insight for innovative solutions.
Interpretable Regional Descriptors: Hyperbox-Based Local Explanations
Dandl, Susanne, Casalicchio, Giuseppe, Bischl, Bernd, Bothmann, Ludwig
This work introduces interpretable regional descriptors, or IRDs, for local, model-agnostic interpretations. IRDs are hyperboxes that describe how an observation's feature values can be changed without affecting its prediction. They justify a prediction by providing a set of "even if" arguments (semi-factual explanations), and they indicate which features affect a prediction and whether pointwise biases or implausibilities exist. A concrete use case shows that this is valuable for both machine learning modelers and persons subject to a decision. We formalize the search for IRDs as an optimization problem and introduce a unifying framework for computing IRDs that covers desiderata, initialization techniques, and a post-processing method. We show how existing hyperbox methods can be adapted to fit into this unified framework. A benchmark study compares the methods based on several quality measures and identifies two strategies to improve IRDs.